Estonian Wordnet: Current State and Future Prospects
نویسندگان
چکیده
This paper presents Estonian Wordnet (EstWN) with its latest developments. We are focusing on the time period of 2011–2017 because during this time EstWN project was supported by the National Programme for Estonian Language Technology (NPELT). We describe which were the goals at the beginning of 2011 and what are the accomplishments today. This paper serves as a summarizing report about the progress of EstWN during this programme. While building EstWN we have been concentrating on the fact, that EstWN as a valuable Estonian resource would also be compatible in a common multilingual framework. 1 Estonian Wordnet: Project Progress Estonian Wordnet is a lexical-semantic resource describing Estonian words and their lexical relationships. The history of EstWN starts already in 1998 when Estonian team joined the EuroWordNet (EWN) project (see also Vossen 1998). Back at 1998 the only available example was Princeton WordNet (PWN) (Fellbaum 1998), so the EWN project followed the same principles. The EWN added a completely new component – multilinguality – the possibility to link different languages via a central InterLingualIndex (ILI) that was based on PWN version 1.5 at that time. At the beginning of 2011 the EstWN had reached around 40 000 concepts (including 10 000 synsets taken over automatically), by September 2017 there are around 85 000 concepts with 230 664 semantic relations and 135 497 senses in EstWN. Over the years EstWN project has been mainly supported by the National Programme for Estonian Language Technology, the first programme lasted from 2006–2010 and the second one from 2011–2017. We greatly appreciate that the Estonian government has realized that it is crucial to support the creation of Estonian language re1 National Programme for Estonian Language, https://www.keeletehnoloogia.ee/en. sources so that the Estonian language is able to survive in the digital world among the larger languages. There are two main directions in EstWN project – to add new and missing concepts and to improve the quality of existing data – for example performing the systematic revision of English equivalents and semantic relations or complementing EstWN with extra-information like sentiment, domain (see Bentivogli 2004) etc. Recently some wordnets have employed sentiment (opinion) information and also in EstWN 57 000 synsets have been automatically annotated with SentiWordNet’s (see Baccianella et al. 2010) data. In addition to SentiWordNet, we have incorporated sense annotated vocabulary from the dictionary made for emotion detection (this vocabulary is manually tagged by linguists, see Pajupuu et al. 2016). Besides to the negativepositive-neutral scale, there is also contradictorytag in this vocabulary, for example, emotional, receptive could be both positive or negative, depending on context. In the future, we plan to get sentiment tags for all synsets in the latest version of EstWN. In the long run, we expect that EstWN will be implemented more frequently as a language technology resource and for linguistic studies as well. Another important foresight is to belong into a unified global linguistic data infrastructure. While building EstWN we still follow general PWN principles and structure to enable linking, but at the same time, the EstWN should remain as language-specific as possible. 1.1 Where do new synsets come from? Our team started to compile EstWN from translating base concepts and then we extended EstWN with the knowledge from different lexicons, corpora etc. Since EstWN has been mostly manual work of different people, then the semantic relations reflect largely human subjectivity. We have included vocabulary from dictionaries like Estonian Explanatory Dictionary, Orthological Dictionary, different terminology dictionaries, word frequency lists of corpora of written Estonian. Since general vocabulary of Estonian is covered, then we have moved on to special terminology. Although Martin Benjamin (2017) has written that “too many specialist terms would make PWN so unwieldy that the resource would become dysfunctional for users trying to sift through numerous esoteric senses” we continue to add vocabularies from different domains for the purpose of more broader usage of EstWN. Also, several students have contributed their work of the bachelor’s thesis to improve EstWN – for example, the vocabulary of veganism, climate, transportation etc has deeply studied and semantic relations inside chosen vocabulary have been thoroughly examined. The computer game Alias which draws information from EstWN is also useful for feedback of the new and missing words and senses (we talked about it on last conference (Aller et al. 2016)). 1.2 Automatically generated synsets At some point during the project, it seemed sensible to construct some part of the resource automatically. Only a few attempts have been made to increase the database (semi)-automatically before 2011. We have to admit, that these attempts haven’t been overly successful and there are still problems to deal with. Firstly, we included words that were missing from word sense disambiguation corpus but ended up with lots of proper names and words belonging already to some existing synset. Then synsets from the Dictionary of Synonyms were transferred automatically, but these synsets needed many corrections because the distinction between synonym and near-synonym was not clearly visible. Also, a lot of dialectal and archaic words were included, but not systematically or consistently. Ideally, we would want to have a broad coverage of vocabulary. That was the reason for our attempt to add automatically nominalizations, especially words with the suffixes -ja (equal to -er suffix in English) and -mine (equal to -ing suffix in English). In this way, almost 10 000 synsets were added. Unfortunately, very many of these derivations are not valid because both one internal and one external relation were generated automatically – internal with xpos_hypernym relation linked to a verb and external equal_hyperonym relation to a verb. This lead into a confusing situation, because both relations are not accurate and more importantly link only to another part of speech, which does not follow the principles of wordnet. For example, the verb synset ‘say, state, tell’ got automatically several xpos_hyponyms (all following synset are nouns): lisamine, täiendamine ‘adding’ andmine ‘giving’ deklareerimine, kuulutamine ‘declearing’ hõikamine, hõiskamine ‘whooping’ protestimine ‘protesting’ esitamine ‘presenting’ kordamine ‘repeating’ vastamine ‘answering’. Another problem occurred while transferring these derivations into EstWN – although the verb as a derivation base can have multiple senses, then the derived nouns with -mine and -ja suffix don’t share the same senses – not syntactically and not semantically. For example, the word andma ‘to give’ has 14 senses in EstWN, but derivations andmine ‘giving’ and andja ‘giver’ are used only in some of these 14 senses. The revision of automatic derivations is quite challenging since they also miss definitions. We still deal with these derivations manually – either fix the set of relations and add definitions or delete the invalid concepts completely. Because of rich Estonian morphology many derivations are possible, like adverbs which are easily derived from other word classes, for example, ahne ‘greedy’ – ahnelt ‘greedily’ (Kerner et al. 2010). However, the described experiments have made us cautious about fully automatic enlargements, since the manual correction is unreasonably time-consuming. Of course, we are open to implementing proven automatic extension methods, which measure up to the quality of manual work. 1.3 How to define synsets – general challenges It is widely known that definitions are difficult to write and take a lot of time even in one’s mother tongue, yet they provide clarity both for native speakers and foreigners (Benjamin 2017). Because a lot of synsets in EstWN are missing definitions, we have to provide them a proper one, if possible. The problem of definitions originates from our existing dictionaries of Estonian – we can find a lot of tautology – an unnecessary repetition of meaning. None of the dictionaries we have used contain information about hierarchical concepts. The explanatory dictionary features information about hypernym (also synonyms, near-synonyms or antonyms) for some headwords in definitions, but this information is, unfortunately, unsystematic and can be rather confusing. In Estonian, it is possible (and common) to rewrite concepts with compound words, since patterns of compound word formation are productive in Estonian (Kerge 2016). Again, the problem of tautology arises if a synset contains a compound word, for example, hüpertoonia+haige ‘hypertonia+sick person’, hüpertoonik, kõrgvererõhu+haige – ‘person, who suffers from hypertonia’. A good definition is meant to paraphrase the concepts, but tools (i.e words) seem to be missing. Lew (2015) has pointed out, that surprisingly people look up the explanation of meaning firstly through synonyms, so it might be more helpful in some cases to pay attention to synset members rather than to a (bad) definition. Similarly, from the Estonian Text Simplification application (Peedosk 2017) appeared that for the better understanding of a concept it is essential to be able to choose between foreign word and native word (encephalitis vs. ajupõletik ‘inflammation of the brain’ or kõht ‘belly’ vs. abdoomen ‘abdomen’). Native words are often more informative to native speakers, whereas foreign word is understandable to foreigners (and through the foreign word they are able to learn and understand the native word). 2 EstWN odyssey from ILI1.5 to PWN3.0
منابع مشابه
Concerning the Difference Between a Conception and its Application in the Case of the Estonian WordNet
One source of Estonian WordNet have been corpora of Estonian. On the other hand, we get interested in word sense disambiguation, and about 100,000 words in corpora are manually disambiguated according to Estonian WordNet senses. The aim of this paper is to explain some theoretical problems that “do not work well in practice”. These include the differentiation of word senses, metaphors, and conc...
متن کاملFirst steps in checking and comparing Princeton WordNet and Estonian Wordnet
Each expanding and developing system requires some feedback to evaluate the normal trends of the system and also the unsystematic steps. In this paper two lexicalsemantic databases – Princeton WordNet (PrWN) and Estonian Wordnet (EstWN)are being examined from the visualization point of view. The visualization method is described and the aim is to find and to point to possible problems of synset...
متن کاملEngineering of Membrane Gas Separation Processes: State of The Art and Prospects
Membrane processes are today one of the key technologies for industrial gas separations and show growing interest for future use in sustainable production systems. Besides materials development, dedicated engineering methods are of major importance for the rigorous and most efficient design of membrane units and systems. Starting from approaches based on simplified hypotheses developed in the 5...
متن کاملSome Problems in Estonian Wordnet
WN is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. Nouns, verbs, adjectives and adverbs are organised into synonym sets, each representing one underlying lexical concept. Different kinds of semantic relations link the synonym sets (synsets). WN is based on word meaning; all of the words that can express a given sense ...
متن کاملAutomatic WSD: Does it Make Sense of Estonian?
This paper describes a fully automatic Estonian word sense disambiguation system called semyhe which is based on Estonian WordNet (EstWN) hyponymjhypernym hierarchies and meant to disambiguate both nouns and verbs. 1 Short description of the system The main inspiration for our system is Agirre and Rigau (1996) similar system that disambiguates the English noun senses based on WordNet hyponymjhy...
متن کامل